Intermediate Importing Data in R

R Markdown

Relational Databases

What is a relational database?
How to connect?
How to read table?

Database Management System

DBMS
Open Source
- MySQL, PostgreSQL, SQLite
Proprietary
- Oracle Database, Microsoft SQL Server
SQL = Structured Query Language

Databases in R

Different R packages
MySQL: RMySQL
PostgresSQL: RPostgresSQL
Oracle Database: ROracle
Conventions specified in DBI
- interface -> package (implementation)

Establish a connection

The first step to import data from a SQL database is creating a connection to it. As Filip explained, you need different packages depending on the database you want to connect to. All of these packages do this in a uniform way, as specified in the DBI package.

dbConnect() creates a connection between your R session and a SQL database. The first argument has to be a DBIdriver object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with RMySQL::MySQL().

If the MySQL database is a remote database hosted on a server, you’ll also have to specify the following arguments in dbConnect(): dbname, host, port, user and password. Most of these details have already been provided.

# Load the DBI package
library(DBI)

# Edit dbConnect() call
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")
con

## <MySQLConnection:0,0>

Awesome! You are now connected to the MySQL database tweater.

Import table data

List the database tables

After you’ve successfully connected to a remote MySQL database, the next step is to see what tables the database contains. You can do this with the dbListTables() function. As you might remember from the video, this function requires the connection object as an input, and outputs a character vector with the table names.

# Load the RMySQL package
library(DBI)

# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Build a vector of table names: tables
tables <- dbListTables(con)

# Display structure of tables
str(tables)

##  chr [1:3] "comments" "tweats" "users"

Good! dbListTables() can be very useful to get a first idea about the contents of your database. Can you guess what kind of information this database contains?

Import users

As you might have guessed by now, the database contains data on a more tasty version of Twitter, namely Tweater. Users can post tweats with short recipes for delicious snacks. People can comment on these tweats. There are three tables: users, tweats, and comments that have relations among them. Which ones, you ask? You’ll discover in a moment!

Let’s start by importing the data on the users into your R session. You do this with the dbReadTable() function. Simply pass it the connection object (con), followed by the name of the table you want to import. The resulting object is a standard R data frame.

# Load the DBI package
library(DBI)

# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Import the users table from tweater: users
users <- dbReadTable(con, "users")

# Print users
users

##   id      name     login
## 1  1 elisabeth  elismith
## 2  2      mike     mikey
## 3  3      thea   teatime
## 4  4    thomas tomatotom
## 5  5    oliver olivander
## 6  6      kate  katebenn
## 7  7    anjali    lianja

Import all tables

Next to the users, we’re also interested in the tweats and comments tables. However, separate dbReadTable() calls for each and every one of the tables in your database would mean a lot of code duplication. Remember about the lapply() function? You can use it again here! A connection is already coded for you, as well as a vector table_names, containing the names of all the tables in the database.

# Load the DBI package
library(DBI)

# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Get table names
table_names <- dbListTables(con)

# Import all tables
tables <- lapply(table_names, dbReadTable, conn = con)

# Print out tables
tables

## [[1]]
##      id tweat_id user_id            message
## 1  1022       87       7              nice!
## 2  1000       77       7             great!
## 3  1011       49       5            love it
## 4  1012       87       1   awesome! thanks!
## 5  1010       88       6              yuck!
## 6  1026       77       4      not my thing!
## 7  1004       49       1  this is fabulous!
## 8  1030       75       6           so easy!
## 9  1025       88       2             oh yes
## 10 1007       49       3           serious?
## 11 1020       77       1 couldn't be better
## 12 1014       77       1       saved my day
## 
## [[2]]
##   id user_id
## 1 75       3
## 2 88       4
## 3 77       6
## 4 87       5
## 5 49       1
## 6 24       7
##                                                                  post
## 1                                       break egg. bake egg. eat egg.
## 2                           wash strawberries. add ice. blend. enjoy.
## 3                       2 slices of bread. add cheese. grill. heaven.
## 4               open and crush avocado. add shrimps. perfect starter.
## 5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 6                              just eat an apple. simply and healthy.
##         date
## 1 2015-09-05
## 2 2015-09-14
## 3 2015-09-21
## 4 2015-09-22
## 5 2015-09-22
## 6 2015-09-24
## 
## [[3]]
##   id      name     login
## 1  1 elisabeth  elismith
## 2  2      mike     mikey
## 3  3      thea   teatime
## 4  4    thomas tomatotom
## 5  5    oliver olivander
## 6  6      kate  katebenn
## 7  7    anjali    lianja

Wonderful! Now that you have an R version of all data that is contained in the database, you can dive a little deeper into the relations between the different data frames.

SQL Queries from inside R

Selective importing

SQL Queries
DBI -> RMySQL, RPostgreSQL
Just the basics of SQL

Query tweater (1)

In your life as a data scientist, you’ll often be working with huge databases that contain tables with millions of rows. If you want to do some analyses on this data, it’s possible that you only need a fraction of this data. In this case, it’s a good idea to send SQL queries to your database, and only import the data you actually need into R.

dbGetQuery() is what you need. As usual, you first pass the connection object to it. The second argument is an SQL query in the form of a character string. This example selects the age variable from the people dataset where gender equals "male":

dbGetQuery(con, "SELECT age FROM people WHERE gender = 'male'")

A connection to the tweater database has already been coded for you.

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Import tweat_id column of comments where user_id is 1: elisabeth
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = 1")

# Print  elisabeth
elisabeth

##   tweat_id
## 1       87
## 2       49
## 3       77
## 4       77

Nice! To make sure you understood SQL’s SELECT - FROM - WHERE syntax, let’s practice some more.

Query tweater (2)

Apart from checking equality, you can also check for less than and greater than relationships, with < and >, just like in R.

con, a connection to the tweater database, is again available.

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Import post column of tweats where date is higher than '2015-09-21': latest
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > '2015-09-21'")

# Print latest
latest

##                                                                  post
## 1               open and crush avocado. add shrimps. perfect starter.
## 2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 3                              just eat an apple. simply and healthy.

Nice! To make sure you understood SQL’s SELECT - FROM - WHERE syntax, let’s practice some more.

Query tweater (3)

Suppose that you have a people table, with a bunch of information. This time, you want to find out the age and country of married males. Provided that there is a married column that’s 1 when the person in question is married, the following query would work.

SELECT age, country
  FROM people
    WHERE gender = "male" AND married = 1

Can you use a similar approach for a more specialized query on the tweater database?

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Create data frame specific
specific <- dbGetQuery(con, "SELECT message FROM comments WHERE tweat_id = 77 AND user_id > 4")

# Print specific and short
specific

##   message
## 1  great!

Query tweater (4)

There are also dedicated SQL functions that you can use in the WHERE clause of an SQL query. For example, CHAR_LENGTH() returns the number of characters in a string.

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Create data frame short
short <- dbGetQuery(con, "SELECT id, name FROM users WHERE CHAR_LENGTH(name) < 5")

# Print short
short

##   id name
## 1  2 mike
## 2  3 thea
## 3  6 kate

Join the query madness!

Of course, SQL does not stop with the the three keywords SELECT, FROM and WHERE. Another very often used keyword is JOIN, and more specifically INNER JOIN. Take this call for example:

SELECT name, post
  FROM users INNER JOIN tweats on users.id = user_id
    WHERE date > "2015-09-19"

Here, the users table is joined with the tweats table. This is possible because the id column in the users table corresponds to the user_id column in the tweats table. Also notice how name, from the users table, and post and date, from the tweats table, can be referenced to without problems.

Can you predict the outcome of the following query?

SELECT post, message
  FROM tweats INNER JOIN comments on tweats.id = tweat_id
    WHERE tweat_id = 77

A connection to the tweater database is already available as con; feel free to experiment!

dbGetQuery(con, "SELECT post, message
                 FROM tweats INNER JOIN comments on tweats.id = tweat_id
                 WHERE tweat_id = 77")

##                                            post            message
## 1 2 slices of bread. add cheese. grill. heaven.             great!
## 2 2 slices of bread. add cheese. grill. heaven.      not my thing!
## 3 2 slices of bread. add cheese. grill. heaven. couldn't be better
## 4 2 slices of bread. add cheese. grill. heaven.       saved my day

Bellissimo! Have you tried running this query with dbGetQuery()? The result contains data both from the tweats and the comments column.

DBI internals

Send - Fetch - Clear

You’ve used dbGetQuery() multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:

Sending the specified query with dbSendQuery();
Fetching the result of executing the query on the database with dbFetch();
Clearing the result with dbClearResult().

Let’s not use dbGetQuery() this time and implement the steps above. This is tedious to write, but it gives you the ability to fetch the query’s result in chunks rather than all at once. You can do this by specifying the n argument inside dbFetch().

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Send query to the database
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")

# Use dbFetch() twice
dbFetch(res, n = 2)

##     id tweat_id user_id message
## 1 1022       87       7   nice!
## 2 1000       77       7  great!

dbFetch(res)

##     id tweat_id user_id  message
## 1 1011       49       5  love it
## 2 1010       88       6    yuck!
## 3 1030       75       6 so easy!

# Clear res
dbClearResult(res)

## [1] TRUE

Perfect! In our toy example, chopping up the fetches doesn’t make a lot of sense, but make sure to remember this technique when you’re struggling with huge databases!

Be polite and …

Every time you connect to a database using dbConnect(), you’re creating a new connection to the database you’re referencing. RMySQL automatically specifies a maximum of open connections and closes some of the connections for you, but still: it’s always polite to manually disconnect from the database afterwards. You do this with the dbDisconnect() function.

The code that connects you to the database is already available, can you finish the script?

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Create the data frame  long_tweats
long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE CHAR_LENGTH(post) > 40")

# Print long_tweats
long_tweats

##                                                                  post
## 1                           wash strawberries. add ice. blend. enjoy.
## 2                       2 slices of bread. add cheese. grill. heaven.
## 3               open and crush avocado. add shrimps. perfect starter.
## 4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
##         date
## 1 2015-09-14
## 2 2015-09-21
## 3 2015-09-22
## 4 2015-09-22

# Disconnect from the database
dbDisconnect(con)

## [1] TRUE

Wonderful! This concludes the chapter on databases. Of course, there is tons more to learn about interfacing to databases and working with them as efficiently as possible, but that’s something for more advanced courses.

HTTP

Data on the web

already worked on it!
many packages handle it for you
file formats useful for web technology
JSON

Import flat files from the web

In the video, you saw that the utils functions to import flat file data, such as read.csv() and read.delim(), are capable of automatically importing from URLs that point to flat files on the web.

You must be wondering whether Hadley Wickham’s alternative package, readr, is equally potent. Well, figure it out in this exercise! The URLs for both a .csv file as well as a .delim file are already coded for you. It’s up to you to actually import the data. If it works, that is…

# Load the readr package
library(readr)

# Import the csv file: pools
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(url_csv)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Name = col_character(),
##   Address = col_character(),
##   Latitude = col_double(),
##   Longitude = col_double()
## )

# Import the txt file: potatoes
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_tsv(url_delim)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   area = col_double(),
##   temp = col_double(),
##   size = col_double(),
##   storage = col_double(),
##   method = col_double(),
##   texture = col_double(),
##   flavor = col_double(),
##   moistness = col_double()
## )

# Print pools and potatoes
pools

## # A tibble: 20 x 4
##    Name                         Address                       Latitude Longitude
##    <chr>                        <chr>                            <dbl>     <dbl>
##  1 Acacia Ridge Leisure Centre  1391 Beaudesert Road, Acacia~    -27.6      153.
##  2 Bellbowrie Pool              Sugarwood Street, Bellbowrie     -27.6      153.
##  3 Carole Park                  Cnr Boundary Road and Waterf~    -27.6      153.
##  4 Centenary Pool (inner City)  400 Gregory Terrace, Spring ~    -27.5      153.
##  5 Chermside Pool               375 Hamilton Road, Chermside     -27.4      153.
##  6 Colmslie Pool (Morningside)  400 Lytton Road, Morningside     -27.5      153.
##  7 Spring Hill Baths (inner Ci~ 14 Torrington Street, Spring~    -27.5      153.
##  8 Dunlop Park Pool (Corinda)   794 Oxley Road, Corinda          -27.5      153.
##  9 Fortitude Valley Pool        432 Wickham Street, Fortitud~    -27.5      153.
## 10 Hibiscus Sports Complex (up~ 90 Klumpp Road, Upper Mount ~    -27.6      153.
## 11 Ithaca Pool ( Paddington)    131 Caxton Street, Paddington    -27.5      153.
## 12 Jindalee Pool                11 Yallambee Road, Jindalee      -27.5      153.
## 13 Manly Pool                   1 Fairlead Crescent, Manly       -27.5      153.
## 14 Mt Gravatt East Aquatic Cen~ Cnr wecker Road and Newnham ~    -27.5      153.
## 15 Musgrave Park Pool (South B~ 100 Edmonstone Street, South~    -27.5      153.
## 16 Newmarket Pool               71 Alderson Stret, Newmarket     -27.4      153.
## 17 Runcorn Pool                 37 Bonemill Road, Runcorn        -27.6      153.
## 18 Sandgate Pool                231 Flinders Parade, Sandgate    -27.3      153.
## 19 Langlands Parks Pool (Stone~ 5 Panitya Street, Stones Cor~    -27.5      153.
## 20 Yeronga Park Pool            81 School Road, Yeronga          -27.5      153.

potatoes

## # A tibble: 160 x 8
##     area  temp  size storage method texture flavor moistness
##    <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>
##  1     1     1     1       1      1     2.9    3.2       3  
##  2     1     1     1       1      2     2.3    2.5       2.6
##  3     1     1     1       1      3     2.5    2.8       2.8
##  4     1     1     1       1      4     2.1    2.9       2.4
##  5     1     1     1       1      5     1.9    2.8       2.2
##  6     1     1     1       2      1     1.8    3         1.7
##  7     1     1     1       2      2     2.6    3.1       2.4
##  8     1     1     1       2      3     3      3         2.9
##  9     1     1     1       2      4     2.2    3.2       2.5
## 10     1     1     1       2      5     2      2.8       1.9
## # ... with 150 more rows

Great! It seems to work without any additional problems!

Secure importing

In the previous exercises, you have been working with URLs that all start with http://. There is, however, a safer alternative to HTTP, namely HTTPS, which stands for HypterText Transfer Protocol Secure. Just remember this: HTTPS is relatively safe, HTTP is not.

Luckily for us, you can use the standard importing functions with https:// connections since R version 3.2.2.

# https URL to the swimming_pools csv file.
url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"

# Import the file using read.csv(): pools1
pools1 <- read.csv(url_csv)

# Load the readr package
library(readr)

# Import the file using read_csv(): pools2
pools2 <- read_csv(url_csv)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Name = col_character(),
##   Address = col_character(),
##   Latitude = col_double(),
##   Longitude = col_double()
## )

# Print the structure of pools1 and pools2
str(pools1)

## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...

str(pools2)

## tibble [20 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Name     : chr [1:20] "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr [1:20] "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num [1:20] -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num [1:20] 153 153 153 153 153 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Name = col_character(),
##   ..   Address = col_character(),
##   ..   Latitude = col_double(),
##   ..   Longitude = col_double()
##   .. )

Great! It seems to work without any additional problems!

Downloading files

Why download.file()?

Reproducibility
HTTP from inside R
- authentication
- additional parameters
- httr - Hadley Wickham

Import Excel files from the web

When you learned about gdata, it was already mentioned that gdata can handle .xls files that are on the internet. readxl can’t, at least not yet. The URL with which you’ll be working is already available in the sample code. You will import it once using gdata and once with the readxl package via a workaround.

# Load the readxl and gdata package
library(readxl)
library(gdata)

## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

##

## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

## 
## Attaching package: 'gdata'

## The following object is masked from 'package:stats':
## 
##     nobs

## The following object is masked from 'package:utils':
## 
##     object.size

## The following object is masked from 'package:base':
## 
##     startsWith

# Specification of url: url_xls
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"

# Import the .xls file with gdata: excel_gdata
excel_gdata <- read.xls(url_xls)

# Download file behind URL, name it local_latitude.xls
download.file(url_xls, destfile = "_data/local_latitude.xls", quiet = TRUE, mode = "wb")

# Import the local .xls file with readxl: excel_readxl
excel_readxl <- read_excel("_data/local_latitude.xls")

Nice! It appears that readxl is not (yet?) able to deal with Excel files that are on the web. However, a simply workaround with download.file() fixes this.

Downloading any file, secure or not

In the previous exercise you’ve seen how you can read excel files on the web using the read_excel package by first downloading the file with the download.file() function.

There’s more: with download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also .RData files. An RData file is very efficient format to store R data.

You can load data from an RData file using the load() function, but this function does not accept a URL string as an argument. In this exercise, you’ll first download the RData file securely, and then import the local data file.

# https URL to the wine RData file.
url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"

# Download the wine file to your working directory
download.file(url_rdata, destfile = "_data/wine_local.RData")

# Load the wine data into your workspace using load()
load("_data/wine_local.RData")

# Print out the summary of the wine data
summary(wine)

##     Alcohol        Malic acid        Ash        Alcalinity of ash
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60    
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20    
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50    
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52    
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50    
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00    
##    Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700      
##  Median : 98.00   Median :2.350   Median :2.130   Median :0.3400      
##  Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color intensity       Hue           Proline      
##  Min.   :0.410   Min.   : 1.280   Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0  
##  Median :1.550   Median : 4.680   Median :2.780   Median : 672.0  
##  Mean   :1.587   Mean   : 5.055   Mean   :2.604   Mean   : 745.1  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :3.580   Max.   :13.000   Max.   :4.000   Max.   :1680.0

Great! Another way to load remote RData files is to use the url() function inside load(). However, this will not save the RData file to a local file.

HTTP? httr! (1)

Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files.

httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content.

You can extract the content from the request using the content() function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don’t tell content() how to retrieve the content through the as argument, it’ll try its best to figure out which type is most appropriate based on the content-type.

# Load the httr package
library(httr)

# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)

# Print resp
resp

## Response [http://www.example.com/]
##   Date: 2020-12-11 20:52
##   Status: 200
##   Content-Type: text/html; charset=UTF-8
##   Size: 1.26 kB
## <!doctype html>
## <html>
## <head>
##     <title>Example Domain</title>
## 
##     <meta charset="utf-8" />
##     <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
##     <meta name="viewport" content="width=device-width, initial-scale=1" />
##     <style type="text/css">
##     body {
## ...

# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")

# Print the head of raw_content
head(raw_content)

## [1] 3c 21 64 6f 63 74

Great! The raw content of the response doesn’t make a lot of sense, does it? Luckily, the content() function by default, if you don’t specify the as argument, figures out what type of data you’re dealing with and parses it for you.

HTTP? httr! (2)

Web content does not limit itself to HTML pages and files stored on remote servers such as DataCamp’s Amazon S3 instances. There are many other data formats out there. A very common one is JSON. This format is very often used by so-called Web APIs, interfaces to web servers with which you as a client can communicate to get or store information in more complicated ways.

You’ll learn about Web APIs and JSON in the video and exercises that follow, but some experimentation never hurts, does it?

# httr is already loaded

# Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
resp <- GET(url)

# Print resp
resp

## Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
##   Date: 2020-12-11 20:52
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 932 B

# Print content of resp as text
content(resp, as = "text")

## [1] "{\"Title\":\"Annie Hall\",\"Year\":\"1977\",\"Rated\":\"PG\",\"Released\":\"20 Apr 1977\",\"Runtime\":\"93 min\",\"Genre\":\"Comedy, Romance\",\"Director\":\"Woody Allen\",\"Writer\":\"Woody Allen, Marshall Brickman\",\"Actors\":\"Woody Allen, Diane Keaton, Tony Roberts, Carol Kane\",\"Plot\":\"Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall.\",\"Language\":\"English, German\",\"Country\":\"USA\",\"Awards\":\"Won 4 Oscars. Another 26 wins & 8 nominations.\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.0/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"98%\"},{\"Source\":\"Metacritic\",\"Value\":\"92/100\"}],\"Metascore\":\"92\",\"imdbRating\":\"8.0\",\"imdbVotes\":\"249,146\",\"imdbID\":\"tt0075686\",\"Type\":\"movie\",\"DVD\":\"N/A\",\"BoxOffice\":\"N/A\",\"Production\":\"Rollins-Joffe Productions\",\"Website\":\"N/A\",\"Response\":\"True\"}"

# Print content of resp
content(resp)

## $Title
## [1] "Annie Hall"
## 
## $Year
## [1] "1977"
## 
## $Rated
## [1] "PG"
## 
## $Released
## [1] "20 Apr 1977"
## 
## $Runtime
## [1] "93 min"
## 
## $Genre
## [1] "Comedy, Romance"
## 
## $Director
## [1] "Woody Allen"
## 
## $Writer
## [1] "Woody Allen, Marshall Brickman"
## 
## $Actors
## [1] "Woody Allen, Diane Keaton, Tony Roberts, Carol Kane"
## 
## $Plot
## [1] "Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall."
## 
## $Language
## [1] "English, German"
## 
## $Country
## [1] "USA"
## 
## $Awards
## [1] "Won 4 Oscars. Another 26 wins & 8 nominations."
## 
## $Poster
## [1] "https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg"
## 
## $Ratings
## $Ratings[[1]]
## $Ratings[[1]]$Source
## [1] "Internet Movie Database"
## 
## $Ratings[[1]]$Value
## [1] "8.0/10"
## 
## 
## $Ratings[[2]]
## $Ratings[[2]]$Source
## [1] "Rotten Tomatoes"
## 
## $Ratings[[2]]$Value
## [1] "98%"
## 
## 
## $Ratings[[3]]
## $Ratings[[3]]$Source
## [1] "Metacritic"
## 
## $Ratings[[3]]$Value
## [1] "92/100"
## 
## 
## 
## $Metascore
## [1] "92"
## 
## $imdbRating
## [1] "8.0"
## 
## $imdbVotes
## [1] "249,146"
## 
## $imdbID
## [1] "tt0075686"
## 
## $Type
## [1] "movie"
## 
## $DVD
## [1] "N/A"
## 
## $BoxOffice
## [1] "N/A"
## 
## $Production
## [1] "Rollins-Joffe Productions"
## 
## $Website
## [1] "N/A"
## 
## $Response
## [1] "True"

Great! The fact that httr converts the JSON response body automatically to an R list is very convenient.

APIs & JSON

Other data formats

before: pages and files from the web
JSON
simple, concise, well-structured
human readable
easy to parse amd generate for computers
for communication with web APIs

API

Application Programming Interface
set of routines and protocols for building software
how different components interact
web API
- interface to get or add data to a server

jsonlite

Jeroen Ooms
improvement of earlier packages
consistent, robust
support all use cases

From JSON to R

In the simplest setting, fromJSON() can convert character strings that represent JSON data into a nicely structured R list. Give it a try!

# Load the jsonlite package
library(jsonlite)

# wine_json is a JSON
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'

# Convert wine_json into a list: wine
wine <- fromJSON(wine_json)

# Print structure of wine
str(wine)

## List of 5
##  $ name       : chr "Chateau Migraine"
##  $ year       : int 1997
##  $ alcohol_pct: num 12.4
##  $ color      : chr "red"
##  $ awarded    : logi FALSE

Quandl API

As Filip showed in the video, fromJSON() also works if you pass a URL as a character string or the path to a local file that contains JSON data. Let’s try this out on the Quandl API, where you can fetch all sorts of financial and economical data.

# jsonlite is preloaded

# Definition of quandl_url
quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"

# Import Quandl data: quandl_data
quandl_data <- fromJSON(quandl_url)

# Print structure of quandl_data
str(quandl_data)

## List of 1
##  $ dataset_data:List of 10
##   ..$ limit       : NULL
##   ..$ transform   : NULL
##   ..$ column_index: NULL
##   ..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
##   ..$ start_date  : chr "2012-05-18"
##   ..$ end_date    : chr "2018-03-27"
##   ..$ frequency   : chr "daily"
##   ..$ data        : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
##   ..$ collapse    : NULL
##   ..$ order       : NULL

Great! You successfully imported JSON data directly from the web. If you have a close look at the structure of quandl_data, you’ll see that the data element is a matrix.

OMDb API

In the video, you saw how easy it is to interact with an API once you know how to formulate requests. You also saw how to fetch all information on Rain Man from OMDb. Simply perform a GET() call, and next ask for the contents with the content() function. This content() function, which is part of the httr package, uses jsonlite behind the scenes to import the JSON data into R.

However, by now you also know that jsonlite can handle URLs itself. Simply passing the request URL to fromJSON() will get your data into R. In this exercise, you will be using this technique to compare the release year of two movies in the Open Movie Database.

# The package jsonlite is already loaded

# Definition of the URLs
url_sw4 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0121766&r=json"

# Import two URLs with fromJSON(): sw4 and sw3
sw4 <- fromJSON(url_sw4)
sw3 <- fromJSON(url_sw3)

# Print out the Title element of both lists
sw4$Title

## [1] "Star Wars: Episode IV - A New Hope"

sw3$Title

## [1] "Star Wars: Episode III - Revenge of the Sith"

# Is the release year of sw4 later than sw3?
sw4$Year > sw3$Year

## [1] FALSE

Well done! The fourth episode of the Star Wars saga was released before the third one! Enough of the fromJSON() now. Let’s try to convert some R data into JSON format now.

JSON & jsonlite

Other jsonlite functions

toJSON()
prettify()
minify()

JSON practice (1) JSON is built on two structures: objects and arrays. To help you experiment with these, two JSON strings are included in the sample code. It’s up to you to change them appropriately and then call jsonlite’s fromJSON() function on them each time.

# jsonlite is already loaded

# Challenge 1
json1 <- '[1, 2, 3, 4, 5, 6]'
fromJSON(json1)

## [1] 1 2 3 4 5 6

# Challenge 2
json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
fromJSON(json2)

## $a
## [1] 1 2 3
## 
## $b
## [1] 4 5 6

JSON practice (2)

We prepared two more JSON strings in the sample code. Can you change them and call jsonlite’s fromJSON() function on them, similar to the previous exercise?

# jsonlite is already loaded

# Challenge 1
json1 <- '[[1, 2], [3, 4]]'
fromJSON(json1)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

# Challenge 2
json2 <- '[{"a": 1, "b": 2}, {"a": 3, "b": 4}, {"a": 5, "b": 6}]'
fromJSON(json2)

##   a b
## 1 1 2
## 2 3 4
## 3 5 6

Great! As you can see different JSON data structures will lead to different data structures in R.

toJSON()

Apart from converting JSON to R with fromJSON(), you can also use toJSON() to convert R data to a JSON format. In its most basic use, you simply pass this function an R object to convert to a JSON. The result is an R object of the class json, which is basically a character string representing that JSON.

For this exercise, you will be working with a .csv file containing information on the amount of desalinated water that is produced around the world. As you’ll see, it contains a lot of missing values. This data can be found on the URL that is specified in the sample code.

# jsonlite is already loaded

# URL pointing to the .csv file
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"

# Import the .csv file located at url_csv
water <- read.csv(url_csv, stringsAsFactors = FALSE)

# Convert the data file according to the requirements
water_json <- toJSON(water)

# Print out water_json
water_json

## [{"water":"Algeria","X1992":0.064,"X2002":0.017},{"water":"American Samoa"},{"water":"Angola","X1992":0.0001,"X2002":0.0001},{"water":"Antigua and Barbuda","X1992":0.0033},{"water":"Argentina","X1992":0.0007,"X1997":0.0007,"X2002":0.0007},{"water":"Australia","X1992":0.0298,"X2002":0.0298},{"water":"Austria","X1992":0.0022,"X2002":0.0022},{"water":"Bahamas","X1992":0.0013,"X2002":0.0074},{"water":"Bahrain","X1992":0.0441,"X2002":0.0441,"X2007":0.1024},{"water":"Barbados","X2007":0.0146},{"water":"British Virgin Islands","X2007":0.0042},{"water":"Canada","X1992":0.0027,"X2002":0.0027},{"water":"Cape Verde","X1992":0.002,"X1997":0.0017},{"water":"Cayman Islands","X1992":0.0033},{"water":"Central African Rep."},{"water":"Chile","X1992":0.0048,"X2002":0.0048},{"water":"Colombia","X1992":0.0027,"X2002":0.0027},{"water":"Cuba","X1992":0.0069,"X1997":0.0069,"X2002":0.0069},{"water":"Cyprus","X1992":0.003,"X1997":0.003,"X2002":0.0335},{"water":"Czech Rep.","X1992":0.0002,"X2002":0.0002},{"water":"Denmark","X1992":0.015,"X2002":0.015},{"water":"Djibouti","X1992":0.0001,"X2002":0.0001},{"water":"Ecuador","X1992":0.0022,"X1997":0.0022,"X2002":0.0022},{"water":"Egypt","X1992":0.025,"X1997":0.025,"X2002":0.1},{"water":"El Salvador","X1992":0.0001,"X2002":0.0001},{"water":"Finland","X1992":0.0001,"X2002":0.0001},{"water":"France","X1992":0.0117,"X2002":0.0117},{"water":"Gibraltar","X1992":0.0077},{"water":"Greece","X1992":0.01,"X2002":0.01},{"water":"Honduras","X1992":0.0002,"X2002":0.0002},{"water":"Hungary","X1992":0.0002,"X2002":0.0002},{"water":"India","X1997":0.0005,"X2002":0.0005},{"water":"Indonesia","X1992":0.0187,"X2002":0.0187},{"water":"Iran","X1992":0.003,"X1997":0.003,"X2002":0.003,"X2007":0.2},{"water":"Iraq","X1997":0.0074,"X2002":0.0074},{"water":"Ireland","X1992":0.0002,"X2002":0.0002},{"water":"Israel","X1992":0.0256,"X2002":0.0256,"X2007":0.14},{"water":"Italy","X1992":0.0973,"X2002":0.0973},{"water":"Jamaica","X1992":0.0005,"X1997":0.0005,"X2002":0.0005},{"water":"Japan","X1997":0.04,"X2002":0.04},{"water":"Jordan","X1997":0.002,"X2007":0.0098},{"water":"Kazakhstan","X1997":1.328,"X2002":1.328},{"water":"Kuwait","X1992":0.507,"X1997":0.231,"X2002":0.4202},{"water":"Lebanon","X2007":0.0473},{"water":"Libya","X2002":0.018},{"water":"Malaysia","X1992":0.0043,"X2002":0.0043},{"water":"Maldives","X1992":0.0004},{"water":"Malta","X1992":0.024,"X1997":0.031,"X2002":0.031},{"water":"Marshall Islands","X1992":0.0007},{"water":"Mauritania","X1992":0.002,"X2002":0.002},{"water":"Mexico","X1992":0.0307,"X2002":0.0307},{"water":"Morocco","X1992":0.0034,"X1997":0.0034,"X2002":0.007},{"water":"Namibia","X1992":0.0003,"X2002":0.0003},{"water":"Netherlands Antilles","X1992":0.063},{"water":"Nicaragua","X1992":0.0002,"X2002":0.0002},{"water":"Nigeria","X1992":0.003,"X2002":0.003},{"water":"Norway","X1992":0.0001,"X2002":0.0001},{"water":"Oman","X1997":0.034,"X2002":0.034,"X2007":0.109},{"water":"Peru","X1992":0.0054,"X2002":0.0054},{"water":"Poland","X1992":0.007,"X2002":0.007},{"water":"Portugal","X1992":0.0016,"X2002":0.0016},{"water":"Qatar","X1992":0.065,"X1997":0.099,"X2002":0.099,"X2007":0.18},{"water":"Saudi Arabia","X1992":0.683,"X1997":0.727,"X2002":0.863,"X2007":1.033},{"water":"Senegal","X1992":0,"X2002":0},{"water":"Somalia","X1992":0.0001,"X2002":0.0001},{"water":"South Africa","X1992":0.018,"X2002":0.018},{"water":"Spain","X1992":0.1002,"X2002":0.1002},{"water":"Sudan","X1992":0.0004,"X1997":0.0004,"X2002":0.0004},{"water":"Sweden","X1992":0.0002,"X2002":0.0002},{"water":"Trinidad and Tobago","X2007":0.036},{"water":"Tunisia","X1992":0.008,"X2002":0.013},{"water":"Turkey","X1992":0.0005,"X2002":0.0005,"X2007":0.0005},{"water":"United Arab Emirates","X1992":0.163,"X1997":0.385,"X2007":0.95},{"water":"United Kingdom","X1992":0.0333,"X2002":0.0333},{"water":"United States","X1992":0.58,"X2002":0.58},{"water":"Venezuela","X1992":0.0052,"X2002":0.0052},{"water":"Yemen, Rep.","X1992":0.01,"X2002":0.01}]

Congratulations! As you can see, the JSON you printed out isn’t easy to read. In the next exercise, you will print out some more JSONs, and explore ways to prettify or minify the output.

Minify and prettify

JSONs can come in different formats. Take these two JSONs, that are in fact exactly the same: the first one is in a minified format, the second one is in a pretty format with indentation, whitespace and new lines:

# Mini
{"a":1,"b":2,"c":{"x":5,"y":6}}

# Pretty
{
  "a": 1,
  "b": 2,
  "c": {
    "x": 5,
    "y": 6
  }
}

Unless you’re a computer, you surely prefer the second version. However, the standard form that toJSON() returns, is the minified version, as it is more concise. You can adapt this behavior by setting the pretty argument inside toJSON() to TRUE. If you already have a JSON string, you can use prettify() or minify() to make the JSON pretty or as concise as possible.

# jsonlite is already loaded

# Convert mtcars to a pretty JSON: pretty_json
pretty_json <- toJSON(mtcars, pretty = TRUE)

# Print pretty_json
#pretty_json

# Minify pretty_json: mini_json
mini_json <- minify(pretty_json)

# Print mini_json
mini_json

## [{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"},{"mpg":22.8,"cyl":4,"disp":108,"hp":93,"drat":3.85,"wt":2.32,"qsec":18.61,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Datsun 710"},{"mpg":21.4,"cyl":6,"disp":258,"hp":110,"drat":3.08,"wt":3.215,"qsec":19.44,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Hornet 4 Drive"},{"mpg":18.7,"cyl":8,"disp":360,"hp":175,"drat":3.15,"wt":3.44,"qsec":17.02,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Hornet Sportabout"},{"mpg":18.1,"cyl":6,"disp":225,"hp":105,"drat":2.76,"wt":3.46,"qsec":20.22,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Valiant"},{"mpg":14.3,"cyl":8,"disp":360,"hp":245,"drat":3.21,"wt":3.57,"qsec":15.84,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Duster 360"},{"mpg":24.4,"cyl":4,"disp":146.7,"hp":62,"drat":3.69,"wt":3.19,"qsec":20,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 240D"},{"mpg":22.8,"cyl":4,"disp":140.8,"hp":95,"drat":3.92,"wt":3.15,"qsec":22.9,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 230"},{"mpg":19.2,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.3,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280"},{"mpg":17.8,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.9,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280C"},{"mpg":16.4,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":4.07,"qsec":17.4,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SE"},{"mpg":17.3,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.73,"qsec":17.6,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SL"},{"mpg":15.2,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.78,"qsec":18,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SLC"},{"mpg":10.4,"cyl":8,"disp":472,"hp":205,"drat":2.93,"wt":5.25,"qsec":17.98,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Cadillac Fleetwood"},{"mpg":10.4,"cyl":8,"disp":460,"hp":215,"drat":3,"wt":5.424,"qsec":17.82,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Lincoln Continental"},{"mpg":14.7,"cyl":8,"disp":440,"hp":230,"drat":3.23,"wt":5.345,"qsec":17.42,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Chrysler Imperial"},{"mpg":32.4,"cyl":4,"disp":78.7,"hp":66,"drat":4.08,"wt":2.2,"qsec":19.47,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat 128"},{"mpg":30.4,"cyl":4,"disp":75.7,"hp":52,"drat":4.93,"wt":1.615,"qsec":18.52,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Honda Civic"},{"mpg":33.9,"cyl":4,"disp":71.1,"hp":65,"drat":4.22,"wt":1.835,"qsec":19.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Toyota Corolla"},{"mpg":21.5,"cyl":4,"disp":120.1,"hp":97,"drat":3.7,"wt":2.465,"qsec":20.01,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Toyota Corona"},{"mpg":15.5,"cyl":8,"disp":318,"hp":150,"drat":2.76,"wt":3.52,"qsec":16.87,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Dodge Challenger"},{"mpg":15.2,"cyl":8,"disp":304,"hp":150,"drat":3.15,"wt":3.435,"qsec":17.3,"vs":0,"am":0,"gear":3,"carb":2,"_row":"AMC Javelin"},{"mpg":13.3,"cyl":8,"disp":350,"hp":245,"drat":3.73,"wt":3.84,"qsec":15.41,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Camaro Z28"},{"mpg":19.2,"cyl":8,"disp":400,"hp":175,"drat":3.08,"wt":3.845,"qsec":17.05,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Pontiac Firebird"},{"mpg":27.3,"cyl":4,"disp":79,"hp":66,"drat":4.08,"wt":1.935,"qsec":18.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat X1-9"},{"mpg":26,"cyl":4,"disp":120.3,"hp":91,"drat":4.43,"wt":2.14,"qsec":16.7,"vs":0,"am":1,"gear":5,"carb":2,"_row":"Porsche 914-2"},{"mpg":30.4,"cyl":4,"disp":95.1,"hp":113,"drat":3.77,"wt":1.513,"qsec":16.9,"vs":1,"am":1,"gear":5,"carb":2,"_row":"Lotus Europa"},{"mpg":15.8,"cyl":8,"disp":351,"hp":264,"drat":4.22,"wt":3.17,"qsec":14.5,"vs":0,"am":1,"gear":5,"carb":4,"_row":"Ford Pantera L"},{"mpg":19.7,"cyl":6,"disp":145,"hp":175,"drat":3.62,"wt":2.77,"qsec":15.5,"vs":0,"am":1,"gear":5,"carb":6,"_row":"Ferrari Dino"},{"mpg":15,"cyl":8,"disp":301,"hp":335,"drat":3.54,"wt":3.57,"qsec":14.6,"vs":0,"am":1,"gear":5,"carb":8,"_row":"Maserati Bora"},{"mpg":21.4,"cyl":4,"disp":121,"hp":109,"drat":4.11,"wt":2.78,"qsec":18.6,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Volvo 142E"}]

Great! Hopefully you agree that the pretty format is way easier to read and understand than the minified format! This exercise concludes the course on importing data!

haven

R packages to import data

haven
- Hadley Wickham
- goal: consistent, easy, fast
foreign
- R core team
- support for many data formats

haven

SAS, STATA, and SPSS
ReadStat: C library by Evan Miller
extremely simple to use
single argument: path to file
result: R data frame

Import SAS data with haven

haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:

SAS: read_sas()
STATA: read_dta() (or read_stata(), which are identical)
SPSS: read_sav() or read_por(), depending on the file type.

All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven will then automatically download the file for you before importing it.

You’ll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, sales.sas7bdat, which is available in your current working directory. You can also download the data here.

# Load the haven package
library(haven)

# Import sales.sas7bdat: sales
sales <- read_sas("_data/sales.sas7bdat")

# Display the structure of sales
str(sales)

## tibble [431 x 4] (S3: tbl_df/tbl/data.frame)
##  $ purchase: num [1:431] 0 0 1 1 0 0 0 0 0 0 ...
##  $ age     : num [1:431] 41 47 41 39 32 32 33 45 43 40 ...
##  $ gender  : chr [1:431] "Female" "Female" "Female" "Female" ...
##  $ income  : chr [1:431] "Low" "Low" "Low" "Low" ...
##  - attr(*, "label")= chr "SALES"

Congratulations! As you can see, read_sas() enables you to import SAS data files easily.

Import STATA data with haven

Next up are STATA data files; you can use read_dta() for these.

When inspecting the result of the read_dta() call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it’s best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you’ll need haven’s as_factor() function.

In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found here

# haven is already loaded

# Import the data from the URL: sugar
sugar <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")

# Structure of sugar
str(sugar)

## tibble [10 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Date    : dbl+lbl [1:10] 10,  9,  8,  7,  6,  5,  4,  3,  2,  1
##    ..@ label       : chr "Date"
##    ..@ format.stata: chr "%9.0g"
##    ..@ labels      : Named num [1:10] 1 2 3 4 5 6 7 8 9 10
##    .. ..- attr(*, "names")= chr [1:10] "2004-12-31" "2005-12-31" "2006-12-31" "2007-12-31" ...
##  $ Import  : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
##   ..- attr(*, "label")= chr "Import"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
##   ..- attr(*, "label")= chr "Weight_I"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Export  : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
##   ..- attr(*, "label")= chr "Export"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
##   ..- attr(*, "label")= chr "Weight_E"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  - attr(*, "label")= chr "Written by R."

# Convert values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))

# Structure of sugar again
str(sugar)

## tibble [10 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Date    : Date[1:10], format: "2013-12-31" "2012-12-31" ...
##  $ Import  : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
##   ..- attr(*, "label")= chr "Import"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
##   ..- attr(*, "label")= chr "Weight_I"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Export  : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
##   ..- attr(*, "label")= chr "Export"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
##   ..- attr(*, "label")= chr "Weight_E"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  - attr(*, "label")= chr "Written by R."

Import SPSS data with haven

The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you’re working with, you’ll need either read_sav() - for .sav files - or read_por() - for .por files.

In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory.

# haven is already loaded

# Import person.sav: traits
traits <- read_sav("_data/person.sav")

# Summarize traits
summary(traits)

##     Neurotic      Extroversion   Agreeableness   Conscientiousness
##  Min.   : 0.00   Min.   : 5.00   Min.   :15.00   Min.   : 7.00    
##  1st Qu.:18.00   1st Qu.:26.00   1st Qu.:39.00   1st Qu.:25.00    
##  Median :24.00   Median :31.00   Median :45.00   Median :30.00    
##  Mean   :23.63   Mean   :30.23   Mean   :44.55   Mean   :30.85    
##  3rd Qu.:29.00   3rd Qu.:34.00   3rd Qu.:50.00   3rd Qu.:36.00    
##  Max.   :44.00   Max.   :65.00   Max.   :73.00   Max.   :58.00    
##  NA's   :14      NA's   :16      NA's   :19      NA's   :14

# Print out a subset
subset(traits, Extroversion > 40 & Agreeableness > 40)

## # A tibble: 8 x 4
##   Neurotic Extroversion Agreeableness Conscientiousness
##      <dbl>        <dbl>         <dbl>             <dbl>
## 1       38           43            49                29
## 2       20           42            46                31
## 3       18           42            49                31
## 4       42           43            44                29
## 5       30           42            51                24
## 6       18           42            50                25
## 7       27           45            55                23
## 8       18           43            57                34

Great! You imported a data file from SPSS correctly using read_sav(). Deepen your knowledge in the following exercise!

Factorize, round two

In the last exercise you learned how to import a data file using the command read_sav(). With SPSS data files, it can also happen that some of the variables you import have the labelled class. This is done to keep all the labelling information that was originally present in the .sav and .por files. It’s advised to coerce (or change) these variables to factors or other standard R classes.

The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found here

# haven is already loaded

# Import SPSS data from the URL: work
work <- read_sav("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav")

# Display summary of work$GENDER
summary(work$GENDER)

##    Length     Class      Mode 
##       474 character character

# Convert work$GENDER to a factor
work$GENDER <- as_factor(work$GENDER)

# Display summary of work$GENDER again
summary(work$GENDER)

## Female   Male 
##    216    258

Well done! The as_factor() function has more arguments you can specify, have a look at its documentation to discover more! Proceed to the next video.

foreign

R core team
less consistent
very comprehensive
all kinds of foreign data formats
SAS, STATA, SPSS, Systat, Weka …

SAS

cannot import .sas7bdat
only SAS libraries: .xport
sas7bdat package

STATA

STATA 5 to 12
read.dta()

SPSS

read.spss()

Import STATA data with foreign (1)

The foreign package offers a simple function to import and read STATA data: read.dta().

In this exercise, you will import data on the US presidential elections in the year 2000. The data in florida.dta contains the total numbers of votes for each of the four candidates as well as the total number of votes per election area in the state of Florida (Source: Florida Department of State). The file is available in your working directory, you can download it here if you want to experiment some more.

# Load the foreign package
library(foreign)

# Import florida.dta and name the resulting data frame florida
florida <- read.dta("_data/florida.dta")

# Check tail() of florida
tail(florida)

##     gore  bush buchanan nader  total
## 62  2647  4051       27    59   6784
## 63  1399  2326       26    29   3780
## 64 97063 82214      396  2436 182109
## 65  3835  4511       46   149   8541
## 66  5637 12176      120   265  18198
## 67  2796  4983       88    93   7960

Congratulations! Your first exercise on using foreign to import STATA data is complete. Proceed to the next exercise to learn the proper handling of the arguments that read.dta() includes!

Import STATA data with foreign (2)

Data can be very diverse, going from character vectors to categorical variables, dates and more. It’s in these cases that the additional arguments of read.dta() will come in handy.

The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore. Their meaning is pretty straightforward, as Filip explained in the video. It’s all about correctly converting STATA data to standard R data structures. Type ?read.dta to find out about about the default values.

The dataset for this exercise contains socio-economic measures and access to education for different individuals (Source: World Bank). This data is available as edequality.dta, which is located in the _data folder in your working directory.

# foreign is already loaded

# Specify the file path using file.path(): path
path <- file.path("_data", "edequality.dta")

# Create and print structure of edu_equal_1
edu_equal_1 <- read.dta(path)
str(edu_equal_1)

## 'data.frame':    12214 obs. of  27 variables:
##  $ hhid              : num  1 1 1 2 2 3 4 4 5 6 ...
##  $ hhweight          : num  627 627 627 627 627 ...
##  $ location          : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
##  $ region            : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
##  $ ethnicity_head    : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
##  $ age               : num  37 11 8 73 70 75 79 80 82 83 ...
##  $ gender            : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
##  $ relation          : Factor w/ 9 levels "head                      ",..: 1 3 3 1 2 1 1 2 1 1 ...
##  $ literate          : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ income_mnt        : num  13.3 13.3 13.3 142.5 142.5 ...
##  $ income            : num  160 160 160 1710 1710 ...
##  $ aggregate         : num  1042 1042 1042 3271 3271 ...
##  $ aggr_ind_annual   : num  347 347 347 1635 1635 ...
##  $ educ_completed    : int  2 4 4 4 3 3 3 3 4 4 ...
##  $ grade_complete    : num  4 3 0 3 4 4 4 4 5 5 ...
##  $ grade_all         : num  4 11 8 11 8 8 8 8 13 13 ...
##  $ unemployed        : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ reason_OLF        : int  NA NA NA 3 3 3 9 9 3 3 ...
##  $ sector            : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ occupation        : int  NA NA NA NA NA NA 5 5 NA NA ...
##  $ earn_mont         : num  0 0 0 0 0 0 20 20 0 0 ...
##  $ earn_ann          : num  0 0 0 0 0 0 240 240 0 0 ...
##  $ hours_week        : num  NA NA NA NA NA NA 30 35 NA NA ...
##  $ hours_mnt         : num  NA NA NA NA NA ...
##  $ fulltime          : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ hhexp             : num  100 100 100 343 343 ...
##  $ legacy_pension_amt: num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr ""
##  - attr(*, "formats")= chr [1:27] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int [1:27] 100 100 108 108 108 100 108 108 108 100 ...
##  - attr(*, "val.labels")= chr [1:27] "" "" "location" "region" ...
##  - attr(*, "var.labels")= chr [1:27] "hhid" "hhweight" "location" "region" ...
##  - attr(*, "expansion.fields")=List of 12
##   ..$ : chr [1:3] "_dta" "_svy_su1" "cluster"
##   ..$ : chr [1:3] "_dta" "_svy_strata1" "strata"
##   ..$ : chr [1:3] "_dta" "_svy_stages" "1"
##   ..$ : chr [1:3] "_dta" "_svy_version" "2"
##   ..$ : chr [1:3] "_dta" "__XijVarLabcons" "(sum) cons"
##   ..$ : chr [1:3] "_dta" "ReS_Xij" "cons"
##   ..$ : chr [1:3] "_dta" "ReS_str" "0"
##   ..$ : chr [1:3] "_dta" "ReS_j" "group"
##   ..$ : chr [1:3] "_dta" "ReS_ver" "v.2"
##   ..$ : chr [1:3] "_dta" "ReS_i" "hhid dur"
##   ..$ : chr [1:3] "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc,  gall, health, rent, durables we"| __truncated__
##   ..$ : chr [1:3] "_dta" "note0" "1"
##  - attr(*, "version")= int 7
##  - attr(*, "label.table")=List of 12
##   ..$ location: Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "urban location" "rural location"
##   ..$ region  : Named int [1:9] 1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr [1:9] "Sofia city" "Bourgass" "Varna" "Lovetch" ...
##   ..$ ethnic  : Named int [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "Bulgaria" "Turks" "Roma" "Other"
##   ..$ s2_q2   : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "male" "female"
##   ..$ s2_q3   : Named int [1:9] 1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr [1:9] "head                      " "spouse/partner            " "child                     " "son/daughter-in-law       " ...
##   ..$ lit     : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "no" "yes"
##   ..$         : Named int [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "never attanded" "primary" "secondary" "postsecondary"
##   ..$         : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "Not unemployed" "Unemployed"
##   ..$         : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr [1:10] "student" "housewife/childcare" "in retirement" "illness, disability" ...
##   ..$         : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr [1:10] "agriculture" "mining" "manufacturing" "utilities" ...
##   ..$         : Named int [1:5] 1 2 3 4 5
##   .. ..- attr(*, "names")= chr [1:5] "private company" "public works program" "government,public sector, army" "private individual" ...
##   ..$         : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "no" "yes"

# Create and print structure of edu_equal_2
edu_equal_2 <- read.dta(path, convert.factors = FALSE)
str(edu_equal_2)

## 'data.frame':    12214 obs. of  27 variables:
##  $ hhid              : num  1 1 1 2 2 3 4 4 5 6 ...
##  $ hhweight          : num  627 627 627 627 627 ...
##  $ location          : int  1 1 1 1 1 2 2 2 1 1 ...
##  $ region            : int  8 8 8 9 9 4 4 4 8 8 ...
##  $ ethnicity_head    : int  2 2 2 1 1 1 1 1 1 1 ...
##  $ age               : num  37 11 8 73 70 75 79 80 82 83 ...
##  $ gender            : int  2 2 1 1 2 1 1 2 2 2 ...
##  $ relation          : int  1 3 3 1 2 1 1 2 1 1 ...
##  $ literate          : int  1 2 2 2 2 2 2 2 2 2 ...
##  $ income_mnt        : num  13.3 13.3 13.3 142.5 142.5 ...
##  $ income            : num  160 160 160 1710 1710 ...
##  $ aggregate         : num  1042 1042 1042 3271 3271 ...
##  $ aggr_ind_annual   : num  347 347 347 1635 1635 ...
##  $ educ_completed    : int  2 4 4 4 3 3 3 3 4 4 ...
##  $ grade_complete    : num  4 3 0 3 4 4 4 4 5 5 ...
##  $ grade_all         : num  4 11 8 11 8 8 8 8 13 13 ...
##  $ unemployed        : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ reason_OLF        : int  NA NA NA 3 3 3 9 9 3 3 ...
##  $ sector            : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ occupation        : int  NA NA NA NA NA NA 5 5 NA NA ...
##  $ earn_mont         : num  0 0 0 0 0 0 20 20 0 0 ...
##  $ earn_ann          : num  0 0 0 0 0 0 240 240 0 0 ...
##  $ hours_week        : num  NA NA NA NA NA NA 30 35 NA NA ...
##  $ hours_mnt         : num  NA NA NA NA NA ...
##  $ fulltime          : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ hhexp             : num  100 100 100 343 343 ...
##  $ legacy_pension_amt: num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr ""
##  - attr(*, "formats")= chr [1:27] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int [1:27] 100 100 108 108 108 100 108 108 108 100 ...
##  - attr(*, "val.labels")= chr [1:27] "" "" "location" "region" ...
##  - attr(*, "var.labels")= chr [1:27] "hhid" "hhweight" "location" "region" ...
##  - attr(*, "expansion.fields")=List of 12
##   ..$ : chr [1:3] "_dta" "_svy_su1" "cluster"
##   ..$ : chr [1:3] "_dta" "_svy_strata1" "strata"
##   ..$ : chr [1:3] "_dta" "_svy_stages" "1"
##   ..$ : chr [1:3] "_dta" "_svy_version" "2"
##   ..$ : chr [1:3] "_dta" "__XijVarLabcons" "(sum) cons"
##   ..$ : chr [1:3] "_dta" "ReS_Xij" "cons"
##   ..$ : chr [1:3] "_dta" "ReS_str" "0"
##   ..$ : chr [1:3] "_dta" "ReS_j" "group"
##   ..$ : chr [1:3] "_dta" "ReS_ver" "v.2"
##   ..$ : chr [1:3] "_dta" "ReS_i" "hhid dur"
##   ..$ : chr [1:3] "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc,  gall, health, rent, durables we"| __truncated__
##   ..$ : chr [1:3] "_dta" "note0" "1"
##  - attr(*, "version")= int 7
##  - attr(*, "label.table")=List of 12
##   ..$ location: Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "urban location" "rural location"
##   ..$ region  : Named int [1:9] 1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr [1:9] "Sofia city" "Bourgass" "Varna" "Lovetch" ...
##   ..$ ethnic  : Named int [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "Bulgaria" "Turks" "Roma" "Other"
##   ..$ s2_q2   : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "male" "female"
##   ..$ s2_q3   : Named int [1:9] 1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr [1:9] "head                      " "spouse/partner            " "child                     " "son/daughter-in-law       " ...
##   ..$ lit     : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "no" "yes"
##   ..$         : Named int [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "never attanded" "primary" "secondary" "postsecondary"
##   ..$         : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "Not unemployed" "Unemployed"
##   ..$         : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr [1:10] "student" "housewife/childcare" "in retirement" "illness, disability" ...
##   ..$         : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr [1:10] "agriculture" "mining" "manufacturing" "utilities" ...
##   ..$         : Named int [1:5] 1 2 3 4 5
##   .. ..- attr(*, "names")= chr [1:5] "private company" "public works program" "government,public sector, army" "private individual" ...
##   ..$         : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "no" "yes"

# Create and print structure of edu_equal_3
edu_equal_3 <- read.dta(path, convert.underscore = TRUE)
str(edu_equal_3)

## 'data.frame':    12214 obs. of  27 variables:
##  $ hhid              : num  1 1 1 2 2 3 4 4 5 6 ...
##  $ hhweight          : num  627 627 627 627 627 ...
##  $ location          : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
##  $ region            : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
##  $ ethnicity.head    : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
##  $ age               : num  37 11 8 73 70 75 79 80 82 83 ...
##  $ gender            : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
##  $ relation          : Factor w/ 9 levels "head                      ",..: 1 3 3 1 2 1 1 2 1 1 ...
##  $ literate          : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ income.mnt        : num  13.3 13.3 13.3 142.5 142.5 ...
##  $ income            : num  160 160 160 1710 1710 ...
##  $ aggregate         : num  1042 1042 1042 3271 3271 ...
##  $ aggr.ind.annual   : num  347 347 347 1635 1635 ...
##  $ educ.completed    : int  2 4 4 4 3 3 3 3 4 4 ...
##  $ grade.complete    : num  4 3 0 3 4 4 4 4 5 5 ...
##  $ grade.all         : num  4 11 8 11 8 8 8 8 13 13 ...
##  $ unemployed        : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ reason.OLF        : int  NA NA NA 3 3 3 9 9 3 3 ...
##  $ sector            : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ occupation        : int  NA NA NA NA NA NA 5 5 NA NA ...
##  $ earn.mont         : num  0 0 0 0 0 0 20 20 0 0 ...
##  $ earn.ann          : num  0 0 0 0 0 0 240 240 0 0 ...
##  $ hours.week        : num  NA NA NA NA NA NA 30 35 NA NA ...
##  $ hours.mnt         : num  NA NA NA NA NA ...
##  $ fulltime          : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ hhexp             : num  100 100 100 343 343 ...
##  $ legacy.pension.amt: num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr ""
##  - attr(*, "formats")= chr [1:27] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int [1:27] 100 100 108 108 108 100 108 108 108 100 ...
##  - attr(*, "val.labels")= chr [1:27] "" "" "location" "region" ...
##  - attr(*, "var.labels")= chr [1:27] "hhid" "hhweight" "location" "region" ...
##  - attr(*, "expansion.fields")=List of 12
##   ..$ : chr [1:3] "_dta" "_svy_su1" "cluster"
##   ..$ : chr [1:3] "_dta" "_svy_strata1" "strata"
##   ..$ : chr [1:3] "_dta" "_svy_stages" "1"
##   ..$ : chr [1:3] "_dta" "_svy_version" "2"
##   ..$ : chr [1:3] "_dta" "__XijVarLabcons" "(sum) cons"
##   ..$ : chr [1:3] "_dta" "ReS_Xij" "cons"
##   ..$ : chr [1:3] "_dta" "ReS_str" "0"
##   ..$ : chr [1:3] "_dta" "ReS_j" "group"
##   ..$ : chr [1:3] "_dta" "ReS_ver" "v.2"
##   ..$ : chr [1:3] "_dta" "ReS_i" "hhid dur"
##   ..$ : chr [1:3] "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc,  gall, health, rent, durables we"| __truncated__
##   ..$ : chr [1:3] "_dta" "note0" "1"
##  - attr(*, "version")= int 7
##  - attr(*, "label.table")=List of 12
##   ..$ location: Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "urban location" "rural location"
##   ..$ region  : Named int [1:9] 1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr [1:9] "Sofia city" "Bourgass" "Varna" "Lovetch" ...
##   ..$ ethnic  : Named int [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "Bulgaria" "Turks" "Roma" "Other"
##   ..$ s2_q2   : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "male" "female"
##   ..$ s2_q3   : Named int [1:9] 1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr [1:9] "head                      " "spouse/partner            " "child                     " "son/daughter-in-law       " ...
##   ..$ lit     : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "no" "yes"
##   ..$         : Named int [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "never attanded" "primary" "secondary" "postsecondary"
##   ..$         : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "Not unemployed" "Unemployed"
##   ..$         : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr [1:10] "student" "housewife/childcare" "in retirement" "illness, disability" ...
##   ..$         : Named int [1:10] 1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr [1:10] "agriculture" "mining" "manufacturing" "utilities" ...
##   ..$         : Named int [1:5] 1 2 3 4 5
##   .. ..- attr(*, "names")= chr [1:5] "private company" "public works program" "government,public sector, army" "private individual" ...
##   ..$         : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "no" "yes"

Great! Can you tell the difference between the different versions of read.dta()? For this data, the first version (where you simply specified the file path) will be most useful to work with. Head over to the next exercise to see if you actually understand your data!

Import SPSS data with foreign (1)

All great things come in pairs. Where foreign provided read.dta() to read Stata data, there’s also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE inside read.spss().

In this exercise, you’ll be working with socio-economic variables from different countries (Source: Quantative Data Analysis in Education). The SPSS data is in a file called international.sav, which is in your working directory. You can also download it here if you want to play around with it some more.

# foreign is already loaded

# Import international.sav as a data frame: demo
demo <- read.spss("_data/international.sav", to.data.frame = TRUE)

## re-encoding from CP1252

# Create boxplot of gdp variable of demo
boxplot(demo$gdp)

Import SPSS data with foreign (2)

In the previous exercise, you used the to.data.frame argument inside read.spss(). There are many other ways in which to customize the way your SPSS data is imported.

In this exercise you will experiment with another argument, use.value.labels. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE by default which means that so called labelled variables inside SPSS are converted to factors inside R.

You’ll again be working with the international.sav data, which is available in your current working directory.

# foreign is already loaded

# Import international.sav as demo_1
demo_1 <- read.spss("_data/international.sav", to.data.frame = TRUE)

## re-encoding from CP1252

# Print out the head of demo_1
head(demo_1)

##   id              country  contint m_illit f_illit lifeexpt  gdp
## 1  1 Argentina            Americas     3.0     3.0       16 3375
## 2  2 Benin                  Africa    45.2    74.5        7  521
## 3  3 Burundi                Africa    33.2    48.1        5   86
## 4  4 Chile                Americas     4.2     4.4       14 4523
## 5  5 Dominican Republic   Americas    12.0    12.7       12 2408
## 6  6 El Salvador          Americas    17.6    22.9       11 2302

# Import international.sav as demo_2
demo_2 <- read.spss("_data/international.sav", to.data.frame = TRUE, use.value.labels = FALSE)

## re-encoding from CP1252

# Print out the head of demo_2
head(demo_2)

##   id              country contint m_illit f_illit lifeexpt  gdp
## 1  1 Argentina                  2     3.0     3.0       16 3375
## 2  2 Benin                      1    45.2    74.5        7  521
## 3  3 Burundi                    1    33.2    48.1        5   86
## 4  4 Chile                      2     4.2     4.4       14 4523
## 5  5 Dominican Republic         2    12.0    12.7       12 2408
## 6  6 El Salvador                2    17.6    22.9       11 2302

Congratulations on finishing part 2 of Importing Data in R! See you next time!